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ABSTRACT 


Myanmar script has no fixed delimiters between words or syllables. Therefore, 
to achieve meaningful and correct segmented words from the text is a 
challenging task. This paper has proposed a morpheme-based Myanmar word 
tokenizer which combines rule-based syllable breaking and dictionary lookup 
syllable merging methods with longest string matching approach. The 
proposed approach is tested on a Monolingual dictionary that contains useful 
information for the word segmentation. It also contains above 32,581 words 
including headwords, stop words and essential words with MyanmarS font. 
These words are collected from Myanmar and Essential Words dictionaries. 
According to the experimental results, it can provide the promising 
segmentation accuracy of Myanmar text. 
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INTRODUCTION 

Word segmentation is prerequisite for any Myanmar language processing such 
as part of speech [POS] tagging, search engine, translation, information 
retrieval, and word sense disambiguation and many more of Natural Language 
Processing [NLP] activities. Currently, there has no Myanmar word 
segmentation approach based on the morpheme of the word in Myanmar text 
using a dictionary approach. Morpheme represents the root of a specific word. 
According to the Myanmar language nature, a morpheme is a vital role for the 
machine translation of Myanmar text. By exploiting the power of morpheme 
word, it can achieve the easy way of translation of Myanmar text. 
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In the Myanmar language, there is no statistical corpus 
resources and training data to test the word segmentation 
algorithm for Myanmar language. 

In this paper, we proposed the word segmentation approach 
which is not applied to statistical methods with the corpus. 
This approach is very useful when there is no linguistic 
resource such as corpus and copra for Myanmar language. 
We simply build the monolingual lexicon which is inspired 
by morpheme Myanmar words collected from Myanmar and 
Essential Words dictionaries. Syllables tokenization is 
defined as preprocessing. Syllable segmentation is done by 
using the rules on the syllable structure of Myanmar script 
for the input sentence. T o determine word boundaries of the 
segmented syllables, the proposed approach is applied 
forward longest matching dictionary. This system can 
segment into morpheme-based Myanmar words from the 
input sentence of text by comparing one by one character 
from the input string with the monolingual dictionary. This 
approach is very simple but it proved that this is a practical 
approach which is not available the applicable linguistic 
resources. 

RELATED WORK 

In this section, previous works on Myanmar word 
segmentation are reviewed. Win Pa Pa and NiLar Thein 
experimented Disambiguation in Myanmar Word 
Segmentation. The authors solved the word ambiguity 
problems by combining Forward Maximum Matching, 
Backward Maximum Matching and Joint Entropy. And then, 
they tried to solve the ambiguity problem using a statistical 
approach with the corpus. The authors described Precision of 


word segmentation for this approach was 92% and recall is 
94%. Tun Thura Thet, Jin-Cheon Na, Wunna Ko Ko, was a 
proposed word segmentation for the Myanmar language. 
They applied rule-based syllable segmentation and also used 
dictionary-based statistical syllable merging, for the word 
ambiguity. The authors combined with Mutual Information 
by calculating collocation strength with the corpus. They 
showed that Precision 98.94%, Recall 99.05%, 
FMeasure98.99. 

"Myanmar Word Segmentation using Syllable level Longest 
Matching" was proved by Hla Hla Htay, Kavi Narayana 
Murthy. They used word List above 800,000 words including 
inflected forms. The authors also applied to stop word 
removal first and also used the Ngram approach for syllable 
matching. They achieved Recall 98.81%, Precision 99.11%, 
F_measure 98.95%, also tested on the sentence level which is 
collected from web documents, grammar books and stories. 

MYANMAR LANGUAGE 

Myanmar language is the official language of the Union of the 
Republic of Myanmar and is more than one thousand years 
old. Texts in the Myanmar language use the Myanmar script, 
which is descended from the Brahmi script of ancient South 
India. 

A. Myanmar Script 

A Myanmar text is a string of characters without explicit 
word boundary markup, written in sequence from left to 
right. 
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Myanmar script contains 33 consonants, 8 vowels [free¬ 
standing and attached, 2 diacritics, 11 medials, a vowel killer 
or ASAT, 10 digits and 2 punctuation marks [4]. 

B. Syllable Breaking in Myanmar Text 

Syllable breaking is the process of identifying syllable 
boundaries in a text. The syllable is the smallest unit of 
language. In Myanmar text, a syllable can start with a 
consonant may be followed by a medial consonant. After the 
vowel, a syllable may end with nasalization of the vowel or 
an unreleased glottal stop. At the end of syallable, a final 
consonant usually has an 'asat' sign above it, to show that 
there is no inherent vowel. In multisyllabic words derived 
from an Indian language such as Pali, where two consonants 
occur internally with no intervening vowel, the consonants 
tend to be stacked vertically, and the asat sign is not used. 
There are a set of Myanmar numerals, which are used just 
like Latin digits [2]. Firstly, syllable segmentation is done by 
using the rules on the syllable structure of the Myanmar 
script. Syllable breaking rules are based on combining 
consonant and vowel, devocalizing and kinzi, contractions, 
syllable chaining, distinct letter, single character and loan 
words. In syllable breaking stage, the proposed system 
determines a syllable boundary by comparing pairs of 
characters to find whether a break is possible or not 
between them Moreover, the accuracy results of syllable 
segmentation are described in Table I and Table II. 


MYANMAR WORD SEGMENTATION 

Word segmentation is the process of parsing concatenated 
text [i.e. text that contains no spaces or other word 
separators) to infer where word breaks exist. Myanmar 
script doesn't need to put white spaces between words or 
syllables. Modern writing style contains spaces after each 
clause in order to enhance readability [5]. Generally, a word 
is a basic unit of language that carries meaning and can be 
spoken or written. A Myanmar word can consist of one or 
more morphemes that are linked more or less tightly 
together. 

i^c:s 8 . s 6 oc. q|:. 8 :^ 

And then a Myanmar word will consist of a root or stem and 
zero or more affixes. 

Example: 33-c^- 

COOGSGQ 

Moreover, Myanmar words can be combined to form 
phrases, clauses and sentences. 

□mo ^oSGOJOigpooo: 

33 [SoogS II 

In Addition, a word consisting of two or more stems joined 
together is known as a compound word[3]. 
n2:c!Q^:G€pc:cHJQ. G0:csnS, 


1. Combining consonant and vowel 

oj = □□ -I- j = cosonant - vowel 

□jj 03^033 si ul: GOOI-tp: GoTCOOOO gO II 

33.'' .“'gS /sco: 

/guT /oDo/oopo.'i 

2. Devowelizing and Kinzi Devowelising and Kinzi 

oO'-c-h 5= consonant- consonat ^ asaf 

c-, c-gcc t or' ^ 

Gg: /ep .'g^i/gco •■'sSi 

/ii 

|g-i-^-h8=coiisoiiat-kmzi~!- consonat- 
medial consonat -i-asat 


3. Syllable Changing 

cS^^co-hS-hg- y-h:5o=consoiiat vowel -s-conionat^syllable 
chain- consonat -i-vowel 

CD CO Q C- ^ c 

□oo: /□□; S: /cS/ ^ o / □□ /Sot /otd ii 

4. Single Character 

S3ay3=o-^^-hDGj.-Kio=conso[nat-J-vo\^'iel-sijiglechaiacter-^vowel 
33 /gg- /cl: /G0oaS.' ot® .'S/ otco /o 8 / qS."' qcS /dot /□apa/' 


5. Contraction 

GccioOf^:=u3— GO -I- oo— no— 8— c^—oG-HtK=coosoiiat-i-medial— 
vowei-i-consonat-asat-medial-i-vowel-i-toneiiLaifc 

GU30Cf^:G€D: 30g33C^4 33G0: II 

GCGOOf^: .'got : /OTpS/ss/c^^ /33/G0: 


And then, the next step was to merge the segmented 
syllables into the meaningful word from the input sentence. 
Syllable merging is done by using the longest matching 
approach and mapped with the lexicon. The algorithm starts 
from the beginning of a sentence, finding the longest 
matching word compared with the lexicon and then 
repeating the process until it reaches the end of the 
sentence. This system can segment into a morpheme-based 
word from the input sentence by comparing one by one 
character from the input string with the monolingual 
dictionary. The process of word segmentation is shown in 
Figure2. This system is tested on all types of simple and 
complex sentence types of Myanmar text including one or 
more clauses and phrases. The accuracy results are 
mentioned in Table2. There may be some problems in 
syllable merging of the proposed system. Because of the 
longest matching approach, it cannot give the correct 
segmentation of all words in the input sentence. It can find 
segment conflicts in some word in the sentence. 

Example: 

With the longest matching approach, this sentence is 
segmented to the wrong word into 

c^c/ 33Cg^/ [^|oS/ CUgS /. 


Sentences 

Break sentence into 

syllable 

i ~ 

Merge the syllable int 
words 

^ ~ 

Segmented 
Words 

Figl. Process of Word Segmentation 



^ ^ Rules ^ 


Dictionary 
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The structure of the sentence in Myanmar language may be 
simple and compound or complex. Generally, the sentence is 
subdivided into phrases. The phrase is subdivided into 
words. Word is subdivided into syllables. A syllable is the 
smallest unit of the language [3]. In this case, either simple 
or compound sentence can be contained with one more 
phrases and one or more clauses. A group of words, which 
makes sense, but not complete sense, is called a Phrase. It is 
a group of related words without a Subject and a Verb. 
Examples: in the east, on a wall, with blue trimming, on the 
bridge, with red hair [2]. A clause is a group of words that 
contains both a subj ect and a predicate but cannot always be 
considered as a full grammatical sentence. Clauses can be 
either independent clauses [also called main clauses) or 
dependent clauses [also called subordinate clauses) [2]. Like 
an English sentence, Myanmar sentence is also composed of 
one or more clauses and phrases. Myanmar script contains 
33 consonants, 8 vowels [free-standing and attached, 2 
diacritics, 11 medials, a vowel killer or ASAT, 10 digits and 2 
punctuation marks [4]. 


1. Examples for adding adjective & adverb phrase in a 
simple sentence 

c Ga™c 75 :>G :gd uuDyys^ oog i 

CDDDgS- [m2DQ3CGa3:rN3t5y3c:rrjc^+ 

II 

co/ojpS/ISpSMc ■■Gcoo/Einoc: /g : txj .'an /DopS I 

2. Examples for adding phrases in a simple sentence 

: G^GCcocoy 

o^o5gljocga3£i 

-H^oaSGOcia^^-i- crj - 
coy:c^ 

|3[j5c^[xjgii 

□QOQ^ oS /Gcuo/gpocio: .‘'tjp: .'oog/tjoSS: /aaco^ 

oSaso; ''jm .■'□ogS / 




3. Examples for adding time phrases in a simple 
sentence 

a. 

(303: tp: 5 q ^Sco cojoaS: 

rOgra C^a -r 33 ^ 

0^103(303:tp:Gog (3joao:oj:|ng3Jg3ii 

3|: o:j (303: /tp: /so iDoo!'^^Doo.'^^u^c /e^Jo sS : .'bdgS/ 


4. Examples for adding accusative phrases in a simple 
sentence 

paoo: (p:a3gScg|^GCF:™^(^Gnp^ :goqoc :a:3 loy na^c 

^. c 
spoaojpGn 


oogp; +nw^5 — 33aj:sj^o^ -jSaXj 

paoo: tp:a3gS og| ^GaKo^n^GOrpaSo^iGOGOc: nQoG|:o:jjm|Sc 
c. c 


gcEoo: 'tp: -'oG^/ngi^ gdqS/o^ /EGprrSu goqoc : I'co ■■ao^:ny 

fiS-'lSc/opoG/aopo/ 


5. Examples of a compound sentence with a dependent 


clause and independent clause 


□3^d1[t|| (dependent clause) 

33 Q l(T|| (independent) 

a|; o gS t(p: t^GOg|: sjI 

3^ ^: row c: ^cDGi^ 30^ 11 

/ZG^ /s ^330^ ’^p". .'o^/git^: |/ 


3B^3qT(T^ 

Gfipc: 3JQ:c^zGjpo G^pzS g cpc: 
oDCnd: tp: c:^ay 

[w £c6:^c.^o5^ 30^ II 

Gtrpc: .'o^/bo /GgtpzS /G^pc 

E.'□O; CgQQ !\m; / 

:: /roro .'d: /tp: 


6. Examples of a compound sentence with three 
clauses 


-r—-p-- 

GOQOGOoac txKcu^: rqjcajQCJOTopoj^ ii 

: cp[:S^GCT^:mea3 ii 

ii 

: msco: p: 338<[8j^QgorSGaT(x>aaa^ji 

GO>3<S05DCU<]00-^:0:jjC 3QO<JQOC^Ga?3 

Gcij^:c1r3etro:tp: c^c^saQC\3GaD|2S^c(Ko[S[c^oS 

gnrS3Gc^nSpg^(^c|®^:Gic?o:tp: 
□G8c[S|CgnQGuTcU03G II 

GOGOGOODC /uoSo^; 0^5/bcpoujo/ap •■’GODo/Gg^: /cpoS.-'g /gogi^ : cd 

GCG: ■■'tp: /o^oS -'ocEi i''or 3 Bji^c^ cn /«^p: /e^/nD 

/gco: /tp: /bsS ci^^/Qgt'T^GuT/ODo/txjgv' 

7. Examples of sentence Hidden object 

3300? m^GOGQ^GoT^ HGC^t|ll 
+C!o + 

CO c c Tf ^ 

asooooj CD^GCraOCGOlg SGC^pC^I 

33 Qo 5 /f:^ i''fir 3 ^G 3 Qq£ /guT/§ .'e/GC^/Cj^/ 


8. Examples of sentence changing the position of 
subject and reason 

«LX3o«^Ot3 zio^dS: ^ ^aj) Ca ggopj ■ 

fTGoQo; -\-ab + 35G(cpoc:ja H- jbSmao 

oS:^^ :n^ cv^cacjmoc qzopoli 
u Ljjoo^OG/bo^ ‘oS: ^ •'g{c5g£ /cpj^' 

33c|aGQc:[u — ab -H 

o^OoGjSoc yuQoa^nooG^ E|aj^ri 

ojco .'sfo^pc ■'« Lrjo<s^<-n-.'bo ■'□S: : v'c^ / cpcj^/ 


EXPERIMENT RESULTS 

Table I and Table II show the experimental results of word 
segmentation system for syllable breaking and syllable 
merging word segmentation. Accuracy result for syllable 
breaking is 100% correct. 


TABLE.I Accuracy Results on syllable Segmentation 


Syallable Type 

NCseg 

NTseg 

Accuracy 

Unique Syllable 

1903 

1903 

100% 

Tokens 

7069 

7069 

100% 

Sentence 

1226 

1226 

100% 


Accuracy=NCseg/NTseg*100 

NCseg=the number of correctly segmented syllables by the 
program on the input 

Ntseg=the number of total segmented syllables verified 
manually 
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TABLE.II Accuracy Results on Word Segmentation 


Syallable Type 

NCseg 

NTseg 

Accuracy 

Unique Syllable 

7069 

6769 

95.77% 

Tokens 

1226 

926 

75.53% 


Accuracy=NCmg/NTmg*100 


NCmg=the number of correctly merge syllables by the 
program on the input 

Ntmg=the number of total merge syllables verified manually 
Tested Dictionary contains 32,581 tokens. Sentences are 
tested upon all kind of sentence types, namely {simple, 
compound or complexj.Covers on all complex sentence type 
including a sentence with one clause, two clauses, and three 
clauses. 

CONCLUSION 

This paper has proposed an approach for Myanmar word 
segmentation by using rule-based syllable breaking and 
dictionary lookup syllable merging methods. In the syllable 
breaking stage, the proposed system determines a syllable 
boundary by comparing pairs of characters to find whether a 
break is possible or not between them. And then, it merges 
the segmented syllables into a meaningful word by using the 
dictionary lookup approach with the longest string matching 
algorithm. Moreover, this proposed system can produce 
correct morpheme-based Myanmar words from the input 
sentence. It can also solve to segment the words with one or 
more phrases and clauses of in the written Myanmar 
sentences. It can give the correct segmented words which 
contain one or more dependent clauses and independent 
clauses on all types of simple and compound sentences of 
Myanmar text. So, it can support many benefits to Myanmar 
to English translation system and further (NLP) tasks such as 
information retrieval, noun phrase identification, verb 
phrase identification, named entity recognition, word sense 
disambiguation and many more of NLP activities. 
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